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Abstract 

The  familiar  Bayesian  framework,  where  observed  data  is  used  to  update  prior 
information,  via  Bayes’s  formula,  has  many  desirable  features.  This  project  aims  to 
address  shortcomings  of  this  Bayesian  approach  in  two  essential  problems,  namely, 
prediction  and  inference.  First,  for  the  prediction  problem,  the  Monte  Carlo  com¬ 
putation  required  to  obtain  a  genuine  Bayesian  predictive  distribution  can  be  too 
slow  for  use  with  streaming  data,  and  a  new  recursive  estimator  of  the  Bayesian 
predictive  distribution  is  proposed  which  is  both  fast  to  compute  and  has  desirable 
theoretical  properties.  Second,  for  the  inference  problem,  there  are  cases  where  a 
full  probability  model  for  all  the  unknowns  is  not  available  and/or  is  not  desirable, 
so  there  is  a  need  for  “likelihood-free”  Bayesian  inference.  New  tools  are  developed 
to  address  various  theoretical  and  computational  questions  related  to  the  use  of 
so-called  Gibbs  models  for  such  problems. 


1  Statement  of  the  problems 

1.1  Problem  1:  prediction 

Consider  the  problem  where  data,  Y\, ...  ,Yn, .. .  are  iid  from  some  distribution  with  a 
density  function  p.  After  observing  R, . . . ,  Yn ,  the  goal  is  to  predict  the  next  observation 
Yn+1.  More  specifically,  we  desire  a  rule  that  converts  observations  (R , . . . ,  Yn)  into  a 
density  function  pn  which  is,  in  some  sense,  our  “best  guess”  of  the  distribution  of  Yn+  j . 
This  object  is  usually  called  a  predictive  density.  A  standard  way  to  construct  a  predictive 
distribution  is  by  taking  a  Bayesian  approach.  That  is,  start  with  a  prior  distribution  II 
for  the  unknown  density  p  supported  on  a  subset  P  of  all  density  functions.  Then  the 
Bayesian  approach  constructs  a  posterior  distribution  for  p,  given  (bj , . . .  ,Yn),  denoted 
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by  nn,  that  satisfies 


n  n{B) 


JB  rE‘.iP(>i)n(dp) 

/p  n?_.  p(V0  n(dp)  ’  -  ' 


This  allows  for  a  natural  construction  of  a  predictive  distribution.  Given  the  posterior 
IIn,  define  the  predictive  density  pn  for  Yn+ i,  given  (Yi, . . .  ,Yn),  as  the  posterior  mean 
density,  i.e., 

Pn(y)  =  I  p(y )  n n(dp). 

This  Bayesian  predictive  distribution  is  logically  quite  reasonable  and,  except  for  some 
rather  unusual  cases,  will  have  good  theoretical  properties.  The  downside,  however,  is 
that,  for  standard  choices  of  prior  II,  such  as  a  Dirichlet  process  mixture  model,  the  Monte 
Carlo  computations  required  to  evaluate  pn  are  somewhat  costly.  More  concerning  is  that 
updating  pn  to  pn+i  when  Yn+ 1  is  observed  requires  that  one  redo  all  the  Monte  Carlo 
computations,  so  online  Bayesian  prediction  is  not  possible.  The  first  goal  of  this  project 
is  to  develop  a  recursive  approximation  to  the  Bayesian  predictive  update,  one  that  does 
not  require  Monte  Carlo  computations,  and  has  desirable  theoretical  properties. 


1.2  Problem  2:  inference 

Consider  the  same  general  setup  as  in  the  previous  section,  but  now  the  main  interest  is 
in  some  feature  9  =  8(p)  of  the  distribution  p.  Here  9  could  be  a  finite-dimensional  vector, 
e.g.,  a  set  of  moments,  or  could  be  a  function,  like  in  nonparametric  regression.  From  a 
non-Bayesian  point  of  view,  depending  on  the  setup,  it  may  be  possible  to  produce  an 
estimator  or  other  kinds  of  inference  for  9  without  specifying  a  model  for  p.  For  example, 
one  can  easily  construct  an  estimator  and  an  asymptotically  correct  confidence  interval 
for  a  quantile  without  specifying  a  model  for  the  underlying  distribution  p.  A  Bayesian 
approach,  on  the  other  hand,  requires  a  likelihood  function  which,  in  turn,  requires  a 
model  for  p.  There  is  no  conceptual  difficulty  to  specify  a  prior  n  for  p,  get  a  posterior 
nn  based  on  data  (bj , . . . ,  Yn),  and  then  get  a  corresponding  posterior  distribution  for  9 
via  marginalization.  The  question  is  why  spend  the  time  and  resources  to  specify  a  prior 
distribution  for  the  full  distribution  p  and  carry  out  the  potentially  non-trivial  Monte 
Carlo  computations  to  evaluate  the  posterior  distribution  of  p,  when  the  intention  is  to 
marginalize  over  everything  but  91  A  second  goal  of  this  project  is  to  develop  theoretical 
and  computational  tools  that  will  allow  users  to  carry  out  a  Bayesian  analysis  working 
only  on  the  interest  parameter  9,  avoiding  the  unwanted  tasks  of  prior  specification  and 
Monte  Carlo  computations  over  the  nuisance  parameter  space. 


2  Summary  of  main  results 

2.1  Problem  1:  prediction 

2.1.1  Background 

An  important  special  case  of  the  general  setup  in  Section  1.1  is  that  of  a  Dirichlet  process 
mixture  model  for  the  density  p\  see,  for  example,  Lo  (1984)  and  Escobar  and  West 
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(1995).  That  is,  the  prior  II  models  the  density  p  as  a  mixture 

J  K(y  |  u )  G(du ), 

where  K ( y  \  u )  is  a  kernel,  and  the  mixing  distribution  G  is  modeled  by  a  Dirichlet  process 
distribution  (Ferguson  1973).  By  now  there  is  a  substantial  literature  on  Monte  Carlo 
methods  to  fit  this  model  and  compute  the  corresponding  predictive;  see  MacEachern 
and  Muller  (1998),  Neal  (2000),  Walker  (2007),  and  Kalli  et  al  (2011).  However,  these 
methods  do  not  allow  for  a  mapping  from  the  previous  predictive  and  a  new  observation, 
(pn_i,K„),  to  the  updated  predictive,  pn.  The  burden  of  requiring  Monte  Carlo  methods 
motivated  Newton  and  Zhang  (1999)  and  Newton  (2002)  to  consider  an  approximation 
of  the  full  Dirichlet  process  mixture  model,  called  predictive  recursion,  which  produced 
a  recursive  estimator  of  the  mixing  measure  G.  Extensive  study  of  Newton’s  recursive 
estimator  is  carried  out  in  Tokdar  et  al  (2009)  and  Martin  and  Tokdar  (2009,  2011). 
This  estimator  is  fast  and  easy  to  compute  but,  despite  its  name,  it  does  not  directly 
target  the  predictive  density — numerical  integration  is  needed  to  evaluate  normalizing 
constants,  etc.  The  goal  here  is  to  develop  a  version  of  Newton’s  method  that  directly 
targets  the  predictive  and,  therefore,  does  not  require  numerical  integration.  Interestingly, 
to  accomplish  this  goal,  we  will  need  to  make  unexpected  use  of  copulas. 


2.1.2  Results 


As  a  candidate  for  the  predictive  update,  one  with  a  multiplicative  structure,  consider  a 
bivariate  function  k(y,y')  that  satisfies 

pn(y)  =  Pn-i(y)k(y,Yn). 


Therefore, 


k(y,  Yn) 


which  is  symmetric  in  (■ y,Yn ),  since 


Pn(y) 

Pn-l(y) 


k(y,  Yn) 


J  p(y)p(Yn)Un-i(dp) 
f  p(y)  Tln-i{dp)  f  p(Yn )  nn_i (dp)  ’ 


(1) 


Hahn  et  al  (2015)  show  that  the  function  k(y,Yn)  in  (1)  is  a  bivariate  copula  density 
function  (Nelsen  1999);  that  is,  for  some  symmetric  copula  density  cn,  which  depends  on 
the  sample  only  through  the  sample  size,  we  have 


k(y,  Yn)  =  cn(Pn-1(y),  P„_i(W))  (2) 

where  cn(u,v)  =  cn(v,u)  is  a  symmetric  copula  density,  and  Pn-i  is  the  distribution 
function  corresponding  to  the  predictive  density  pn-\. 

We  can  now  write  the  update  (pn-i,yn)  ^  Pn  as 


Pn(y)  =  Cn{Pn-l{y),Pn-l{yn))Pn-l{y) 


(3) 


and  for  each  Bayesian  model  there  is  a  unique  sequence  cn.  Now  (3)  allows  for  the 
direct  update  of  the  predictive  and  moreover  it  can  be  seen  that  all  one  needs  to  direct  a 
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sequence  of  predictive  densities  is  to  define  a  sequence  of  copula  functions  cn,  the  key  to 
which  is  that  cn  — >  1  as  n  — >  oo,  i.e.,  the  sequence  of  copula  converges  to  the  independent 
copula  as  the  sample  size  increases. 

Unfortunately,  it  is  only  possible  to  write  down  the  corresponding  copula  density  cn 
for  relatively  simple  parametric  models.  We  can,  however,  follow  Newton’s  approach 
and  build  a  recursive  approximation  based  on  what  is  known  about  the  Dirichlet  process 
mixture  model.  In  particular,  fix  an  initial  guess  P0,  with  density  p0,  a  sequence  of  weights 
(an)  C  (0, 1),  and  a  correlation  parameter  p  e  (0, 1).  Then,  sequentially  compute 

Pn(y)  =  (1  -  oin)pn-i(y)  +  anpn-i(y)  hp(Pn-i(y),Pn-i(Yn)),  n  >  1.  (4) 

where  hp  is  the  Gaussian  copula  density 

N^-V),*-1^)  |0,l,p) 

p[U,L  ’  N($-1(m)  |  0,1)N(4>^)  I  0,1)' 

Under  some  mild  regularity  conditions  on  the  true  density,  p*,  if  the  weight  sequence 
(an)  satishes  ^2no:n  =  oo  and  J2nan  <  oo,  then  the  recursive  predictive  pn  in  (4)  is  a 
consistent  estimator,  i.e.,  pn  — >  p *  almost  surely,  as  n  — >  oo,  with  respect  to  the  Li  norm. 
In  addition  to  the  theoretical  consistency  results,  numerical  simulations  reveal  that  the 
recursive  estimator  is  fast  and  easy  to  compute  and  also  highly  accurate. 

2.1.3  Further  developments 

The  paper  (Hahn  et  al  2015)  that  contains  these  results  is  currently  under  revision  for 
the  Journal  of  the  American  Statistical  Association  and,  though  we  cannot  be  sure  of  the 
outcome  at  this  point,  we  expect  that  it  will  ultimately  be  accepted  there.  As  part  of  our 
revision,  we  intend  to  consider  some  possible  extensions  of  the  results  summarized  above, 
including  some  dependent-data  cases,  as  well  as  some  multivariate  prediction  problems. 
In  part  through  the  work  on  recursive  density  estimation  for  this  project,  Stephen  Walker 
and  I  developed  some  new  ideas  for  solving  classical  inverse  problems,  such  as  solving 
Fredholm  equations,  using  statistical  ideas/methods.  We  recently  learned  that  this  new 
project  will  be  supported,  in  part,  by  the  National  Science  Foundation. 

2.2  Problem  2:  inference 

2.2.1  Background 

There  are  a  number  of  statistical  inference  problems  that  are  not  generally  formulated  via 
a  full  probability  model.  Perhaps  the  most  important  example  of  these  is  when  the  goal  is 
inference  on  quantiles,  especially,  quantile  regression.  The  usual  non-Bayesian  approach 
to  quantile  regression,  as  discussed  in  Koenker  (2005),  formulates  the  problem  as  one  of 
optimization,  i.e.,  minimize  a  measure  of  the  empirical  risk,  and  there  is  a  rich  theory 
of  M-estimation  on  which  to  build  this  framework.  A  Bayesian  approach,  on  the  other 
hand,  is  less  straightforward  because  one  is  used  to  building  a  posterior  distribution  based 
on  a  prior  and  a  likelihood,  but  here  there  is  no  likelihood.  Bayesian  quantile  regression 
has  received  some  attention  in  the  literature,  e.g.,  Yu  and  Moyeed  (2001)  and  Srirarn  et 
al  (2013),  and  these  also  make  use  of  the  same  empirical  risk,  but  the  formulation  is  in 
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terms  of  a  misspecified  likelihood.  This  perspective  does  not  make  clear,  however,  how 
to  approach  other  problems  when  no  likelihood  is  present. 

A  concrete  motivating  example  for  this  work  comes  from  a  medical  application  in  He- 
dayat  et  al  (2015).  One  objective  of  a  clinical  trial  is  assessing  the  efficacy  of  a  treatment, 
but  statistical  significance  alone  does  not  necessarily  imply  efficacy.  For  instance,  a  study 
with  high  power  may  detect  statistically  significant  differences  that  do  not  translate  to 
practical  differences  noticeable  by  the  patients.  As  a  result,  a  cutoff  value  different  than 
a  statistical  critical  value  is  desired  that  would  separate  patients  with  clinically  signifi¬ 
cant  responses  from  those  patients  without  a  clinically  significant  response.  This  cutoff 
is  deemed  the  “minimum  clinically  important  difference”  (MCID). 

Let  Y  e  {  —  1, 1}  denote  the  patient  reported  outcome  with  1  meaning  that  the  treat¬ 
ment  was  effective  and  —1  meaning  that  the  treatment  was  not  effective.  Let  X  be  a 
continuous  diagnostic  measure  taken  on  each  patient.  The  MCID,  denoted  by  9 *,  satisfies 

P{Y  7^  sign  (A  —  9)}  =  min  P{Y  ^  sign  (A  —  d)}, 
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i.e.,  9  minimizes,  over  9,  the  probability  that  sign(A  —  9*)  disagrees  with  Y.  Here,  P  is 
with  respect  to  the  joint  distribution  of  (A,  Y).  Since 

P{Y  +  sign(x  -  «)}  =  E{1~yS1S”(A'-')>}, 

and  the  latter  can  be  interpreted  as  an  expected  loss,  the  MCID  is  understood  as  an 
expected  loss  minimizer,  as  discussed  above.  Note  that  the  MCID  is  defined  without 
a  model  for  the  distribution  of  (A,  Y),  but  estimation  can  proceed  without  it;  in  fact, 
Hedayat  et  al  (2015)  proceed  to  estimate  the  MCID  by  minimizing  an  empirical  version 
of  the  above  expression: 

Dm.  (5) 

T )  Zj 

i=1 

How  could  one  solve  this  problem  from  a  Bayesian  perspective?  One  could  introduce  a 
probability  model,  and  priors  for  the  model  parameters,  and  compute  the  corresponding 
posterior,  but  the  results  surely  would  depend  on  the  particular  choice  of  model,  which 
may  not  be  correct.  Moreover,  the  probability  model  likely  would  depend  on  other 
nuisance  parameters,  not  just  on  9  alone,  and  this  complicates  the  problem  even  further. 
It  would  be  desirable  to  produce  a  posterior  distribution  for  9  directly,  without  worrying 
about  if  the  model  is  wrong  or  dealing  with  nuisance  parameters. 

Bissiri  et  al  (2016)  recently  showed  that  a  logical  Bayesian  prior-to-posterior  update 
can  be  carried  out  in  cases  where  only  a  loss  function,  and  not  a  likelihood,  is  available. 
Their  update  corresponds  to  the  so-called  Gibbs  posterior  which  has  been  used  occasion¬ 
ally  in  the  statistics  and  machine  learning  literature.  In  particular,  suppose  the  inference 
problem  is  defined  by  a  loss  function  that  admits  a  corresponding  empirical  risk  function 
Rn{9)  like  that  in  (5).  Then  the  Gibbs  posterior  distribution  satisfies 

Un(d9)  (xe~wnRn(6)U(d9),  (6) 

where  n  is  a  prior  distribution  for  9  and  w  >  0  is  a  scale  parameter.  Note  that  the  Gibbs 
posterior  in  (6)  involves  no  nuisance  parameters,  so  investments  in  prior  specification 
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and  posterior  computations  can  be  focused  on  the  parameter  of  interest.  Compare  this 
to  standard  Bayesian  methods  that  require  a  prior  and  posterior  for  everything.  Although 
the  Gibbs  posterior  does  depend  on  w,  it  often  maintains  good  theoretical  properties  over 
a  range  of  w  values,  and  its  finite-sample  performance  may  even  be  improved  by  using 
data- dependent  choices  of  w,  see  Syring  and  Martin  (2015ab). 

2.2.2  Results 

There  are  three  main  issues  addressed  in  this  project.  First,  to  see  the  benefit  of  the 
Gibbs  approach,  important  examples  need  to  be  identified.  Second,  with  motivation 
coming  from  some  interesting  and  important  applications,  what  are  the  general  properties 
that  Gibbs  posterior  distribution  would  satisfy,  in  particular,  asymptotic  concentration 
properties?  Third,  there  will  be  computational  challenges  to  be  overcome  in  order  to 
implement  the  proposed  methodology.  In  what  follows,  we  will  summarize  the  results 
obtained  on  each  of  these  points. 

In  terms  of  applications,  the  two  previously  mentioned,  namely,  MCID  and  quantile 
regression,  are  important  ones,  and  these  are  addressed  specifically  in  Syring  and  Martin 
(2015ab).  The  MCID  problem  is  essentially  a  classification  problem,  so  there  are  a 
variety  of  extensions  of  this  that  would  be  of  interest.  In  particular,  Hedayat  et  al  (2015) 
investigate  an  interesting  personalized  MCID  problem  where  the  MCID  is  actually  a 
function  of  some  patient-specific  characteristics;  some  preliminary  work  on  a  Bayesian 
approach  to  this  problem  has  been  completed,  but  more  details  remain  to  be  filled  in. 
There  are  a  few  other  interesting  applications  that  we  have  started  working  on,  and  we 
discuss  these  in  more  detail  in  Section  2.2.3  below. 

In  terms  of  general  theory,  we  have  shown  in  Syring  and  Martin  (2015b)  that,  under 
very  general  conditions,  the  Gibbs  posterior  distribution  will  concentrate  around  the 
true  6*  at  the  same  rate  as  the  corresponding  M-estimator,  i.e.,  the  estimator  obtained 
by  minimizing  the  empirical  risk  function  Rn{6).  In  addition,  we  have  shown  that  any 
reasonable  choice  of  the  scale  w ,  even  one  that  depends  suitably  on  n,  will  not  affect 
the  Gibbs  posterior  concentration  rate.  Extensions  of  these  concentration  rate  results  to 
handle  adaptivity  are  currently  being  completed;  see  Section  2.2.3  below. 

In  terms  of  computation,  there  are  several  issues.  In  simple  finite-dimensional  prob¬ 
lems,  such  as  MCID  or  quantile  regression,  it  is  relatively  straightforward  to  compute 
the  Gibbs  posterior  using  the  standard  Metropolis-Hastings  machinery.  On  the  other 
hand,  in  nonparametric  problems  with  infinite-dimensional  parameters,  things  are  not  so 
easy.  Indeed,  the  standard  Bayesian  computational  methods  employed  in  these  problems 
often  take  advantage  of  some  structure — conjugacy  or  otherwise — to  simplify  things  but, 
of  course,  the  empirical  risk  function  is  unlikely  to  have  any  such  structures.  We  have 
been  addressing  this  issue  case-by-case  so  far,  but  progress  has  been  made. 

Finally,  one  last  point  that  fits  in  with  all  three  of  the  categories  mentioned  above 
is  the  choice  of  the  scaling  parameter  w  in  (6).  It  does  not  have  an  effect  on  the  Gibbs 
posterior  concentration  rate,  but  it  does  have  an  effect  on  the  finite  sample  performance. 
What  criterion  should  be  used  to  make  a  good  choice  of  wl  In  Syring  and  Martin  (2015b), 
we  proposed  to  choose  w  such  that  the  corresponding  Gibbs  posterior  credible  regions  are 
calibrated,  i.e.,  so  that  the  credible  regions  are  also  confidence  regions.  The  algorithm 
to  make  this  choice  uses  a  combination  of  bootstrap  and  stochastic  approximation.  Sur- 
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prisingly,  this  choice  of  w  leads  to  exact  nominal  coverage  probability  in  our  simulation 
examples,  for  each  fixed  sample  size,  not  just  asymptotically. 

2.2.3  Further  developments 

The  papers  listed  above  (Syring  and  Martin  2015ab)  are  both  currently  under  revision, 
and  we  expect  to  get  these  back  into  the  review  pipeline  shortly.  The  results  are  good  but, 
so  far,  it  seems  we  have  not  been  able  to  make  clear  what  are  the  main  contributions.  In 
addition  to  these  two  papers,  we  have  one  more  that  is  near  completion.  This  latter  paper 
considers  the  problem  of  nonparametric  estimation  of  the  boundary  of  an  image  based 
on  noisy  intensity  measurements  at  pixels.  This  is  a  particularly  challenging  problem 
for  the  classical  Bayesians  because  they  have  to  specify  the  intensity  distribution  inside 
and  outside  the  image.  This  introduces  nuisance  parameters  and  potential  bias  if  the 
models  are  misspecihed.  He  have  developed  a  Gibbs  posterior  approach  that  is  easy  to 
compute — based  on  a  suitable  reversible  jump  Markov  chain  Monte  Carlo  method — and 
enjoys  adaptive  optimal  concentration  rates.  We  are  particularly  excited  about  this  work, 
and  we  expect  it  will  be  completed  by  the  end  of  May  2016. 

There  are  many  other  potential  applications  to  be  explored  within  this  Gibbs  model 
framework.  One  that  has  caught  my  attention  is  in  the  Levy  process  models  that  often 
appear  in  financial  applications,  i.e. ,  models  for  continuous-time/streaming  data  that 
allow  for  the  possibility  that  the  sample  paths  have  jump  discontinuities.  For  such  models, 
often  the  quantity  of  interest  is  the  Levy  density,  which  is  what  controls  the  size  and 
frequency  of  jumps.  Even  if  the  model  is  fully  specified,  neither  writing  down/evaluating 
the  likelihood  function  nor  simulating  the  data-generating  process  is  straightforward,  so 
Bayesian  methods  have  not  been  used.  A  Gibbs  model,  however,  is  fairly  straightforward 
to  implement  and  can  directly  target  the  Levy  density,  avoiding  nuisance  parameters, 
etc.  This  is  an  exciting  problem  and  I  hope  to  make  some  progress  this  summer. 

3  Student  training 

This  STIR  award  provided  support  for  one  of  my  students,  Nick  Syring,  during  Summer 
and  Fall  of  2015.  The  work  completed  while  Nick  was  supported  by  the  ARO  will  make 
up  a  significant  portion  of  his  doctoral  thesis.  I  have  taken  a  new  job  at  North  Carolina 
State  LIniversity,  starting  in  Fall  2016,  and  Nick  will  be  joining  me  there.  His  thesis  will 
surely  acknowledge  the  gracious  support  from  the  ARO. 
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